DOCUMENT RESCUE 



BD 091 750 

AUTHOR 
TITLE 
PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



ABSTRACT 



CS 201 355 

Diederich, Paul B. 

Cooperative Preparation and Rating of Essay Tests. 
66 

17p.; Reprinted from "English Journal," April 1967* 
Paper presented at the Houston meeting of the 
National Council of Teachers of English* See related 
documents CS 201 320-375 

MF-$0.75 HC-11.50 PLUS POSTAGE 
♦Composition (Literary) ; ♦Educational Research; 
♦Evaluation Methods; Intermediate Grades; Language 
Arts; ♦Measurement Instruments; Post Secondary 
Education; Research Tools; Resource Materials; 
Written Language 

♦The Research Instruments Project; TRIP 



To evaluate the quality of written compositions, 
researchers at Educational Testing Service developed the Composition 
Evaluation Scales (CES), after factor-analytic studies of the reasons 
teachers gave for their judgments of compositions. This is a set of 
eight scales: ideas, organization, wording, flavor, usage, 
punctuation, spelling, and handwriting. Each scale is marked on a 
five-point line — with the scales of ideas and organization receiving 
double weight — yielding a total score of 50. The CES is most 
appropriately used with expository papers on a set topic. [This 
document is one of those reviewed in The Research Instruments Project 
(TRIP) monograph "Measures for Research and Evaluation in the English 
Language Arts" to be published by the Committee on Research of the 
National Council of Teachers of English in cooperation with the ERIC 
Clearinghouse on Reading and Communication Skills. A TRIP review 
which precedes the document lists its category (Writing) , title, 
authors, date, and age range (intermediate — postsecondary ) , and 
describes tho instruments purpose and physical characteristics.] 
(RB) 



ERLC 



NCTE Committee on Research 



The Research Instruments Project (TRIP) 

O 

o 



0 » OEMK I MENT OF HEALTH. 
EDUCATION I WClFARC 
NATION At tNSTITUIEOF 
EOVCAffON 

TH.S OOtUVtM HAS Bt£(s PtP*G 
OUCEO tXACUV AS ktCUvlO f kQV 

M*UD 00 NOt NtCtS5AM*LV ft£PJf€ 
St NT OF t (Ci AL NATIONAL INSTifUtE OF 
ttXKMiON POSITION OR POLICY 



The attached document contains one of the measures reviewed 
in the TRIP committee monograph titled! 

Measures for Research and Evaluation 
in the English Langu age Arts 



TRIP is an acronym which signifies an effort to abstract 
and make readily available measures for research and evalua- 
tion in the English language arts. These measures relate to 
language development, listening, literature, reading, standard 
English as a second language or dialect, teacher competencies, 
or writing. In order to make these instruments more readily 
available, the ERIC Clearinghouse on Reading and Communication 
Skills has supported the TRIP committee sponsored by the Committee 
on Research of the National Council of Teachers of English and 
has processed the material into the ERIC system* The ERIC 
Clearinghouse accession numbers that encompass most of these 
documents are C$£6/3<10 -CS AO/ 



TRIP Committee: 

W.T. Fagan, Chairman 
University of Alberta, Edmonton 

Charles R. Cooper 
State University of New York 
at Buffalo 

Julie M. Jensen 

The Univcisity of Texas at Austin 

)f\ Bernard O'Oonncil 
Director, ERIC/RCS 

?*) Roy C. O'Oonncil 

The University of Georgia 
^ Liaison to NCTE Committee 
^ on Research 



NATIONAL COUNCIL OF TEACHERS OF ENGLiSH 

t1 KEN YON ROAD 
UR&ANA, t^UNOIS 61801 



Category: Writing 

Titles E.T.S. Composition Evaluation Scales 
Authors: Paul Diodcrich, John French, Sydcll Carlton 
Age Range: Intcrrncdiate-«Post-Sccondary 
Description of Instrument: 

Pur pose: To evaluate the quality of written compositions , 

Date of Const ruction: 1961 

Physical Descr iption: The CES was developed by researchers at 
Educational Testing Service after factor-analytic studios of the reasons 
teachers gave for their judgments of compositions • It is a set of eight 
scales: ideas, organization, wording, flavor, usage, punctuation, spelling, 
and handwriting. Each scale is marked on a five-point line--with the 
scales of ideas and organisation receiving double weight— yielding a total 
score of 50, In the full report where CES appears, the high, middle, and 
low points on each scale are described in detail. 

The CES is most appropriately used with expository papers on a set 
topic. It can be compared with the London Scales for creative or imaginative 
writing reviewed in this monograph* 
Validity, Reliability, and Normative Data; 

lhe validity of CES resides in its basis in a study of teachers 1 
reasons for their judgments of compositions. Like all rating scales it 
has high face and "content 11 validity since it is used with whole pieces 
of written discourse. 

Diederich claims that with practice teacher- raters can achieve «? 
reliability of ,90 for a cumulative total of eight ratings, two each on 

ERIC 



(out' different papers by the same writer. In the reports noted below 
Diederich outlines a scliool-wicte cooperative rating scheme based on the 
'JES ■ 

Ordering Information: 
EDRS 

Related Documents: 

Dlederlch, Paul b. ''Cooperative Preparation and Rating of Essay 
Tests, 11 English Journal , 56 (April 1967), 573-58^ , 590. 

Dlcdcrich, Paul B. "How to Measure Growth in Writing Ability," 
English Journal , 55 (April 1966), 435-449. 



ENGLISH JOURNAL 

The Official Journal of the Secondary Section of 
National Council of Teachers of English 

Editor: Richard S. Alm 

University of Hawaii 



Volume * April 1967 Number A 

Wallace Stcvcns-'lt Must He Human" 525 Joseph N. Riddel 

\ Note to the Lady with Whom 1 Dined 

at the Annual teachers' Convention (Verse) 534 Charles Rathbone 

The Hero Within 535 MarciaBrown 

Triumphant-Then Doomed (Verse) 541 Alice Braatz 

Proportioning in Fiction: ^ ^ ^ ^ ^ ^ ^ 

" Th€Pria ^ 552 .Pe^F.N^ 

Introducing Students 561 Sftffr /*» ««y tfiWI 

Safe Is Not Always B«f 562 He/e« B*w/<W» 

Color Him Red 564 Evelyn W.Hall 

The Discipline and Freedom^ j^jj^ T eac h cr 566 Ke«»«* L. Dtoehon 

Cooperative Preparation and r\:.j*.s»u 

1 Rating of Essay Tcsti 573 Vattl B. Diedencb 

The New English in Our School 58 5 Margaret Kemper Botmey 

The New English-A Ludditc View 591 Charles A. Campbell 

The Detroit Public Schools Present „ , i 

English on Television 596 Ethel toucher 

Teaching Creative Writing to 

Emotionally Handicapped Adolescents 603 Lenore Mussen 



PfHM.SSm TO RE F'ROCXKt THIS COPY ' - 

RIGHUO VATEPlAL HAS Bt EM GRAM ED BY : ' • - - " ■ ■ ■ 

- National Council or ^^PW^uuiU.|^u..^ 
Tea chors of English 

TO Efii'c AND ORGANIZATIONS OPERAtiSG. 
UNOCR AGFViMENfSWlTH THE NATIONAL fcN : : •; • 
STRUTS Of EDUCATION FUBTHEA PEPRO 
tXiCTiON O^SDE TH* t'ttC' SYS?£W ■ 

cwires PtRVissoN of fHE copv^GHr ; :■ »■ ; - ^> :^ . ■ 

OWNER: ' ■ "■ : is/^/ ■ ■; /^^^y^^ ^iM', ;•> •: 



Cooperative Preparation and Rating 
of Essay Tests 



Paul B. Diederich 

Senior Research Associate 
Educolional Testing Scmco 
Princeton, New Je.sey 



V|V r real topic is the improvement of 
ill measurement in education by co- 
operative action of departments or teach- 
ing teams, but I shall give particular at- 
tention to the cooperative preparation 
and rating of essay tests and examinations. 
I believe that neither the grading of es- 
says nor any other measurement per- 
formed by teachers is likely to improve 
until responsibility for measurement of 
.the .most imporrant objectives in each 
field lias been tir»rtsf erred from individual 
teachers to the department or team, 

For the past 3 2 years I have been 
visiting schools In many parts of the 
country, trying to help teachers with 
their problems of testing, grading, record- 
keeping, and reporting, Although this is 
not a common occupation, I have not 
been alone in this endeavor. Hundreds 
of courses in tests and measurements 
have been offered, summer institutes and 
workshops on evaluation have multiplied, 
dozens ot books and thousands of articles 
have been written, and hundreds of com- 
munities have brought in consultants on 

Editor's Note: This paper was presented at 
Oic Houston meeting of NCTE, November 



measurement to provide in-service train* 
ing. In addition to these outside influ- 
ences, almost every school district tries 
to improve its report cards about once 
every ten years. What has been the 
result? 

As 1 visit schools now and examine 
everything that teachers arc doing to ap- 
praise what their students have Teamed, 
I cannot point to a single important 
change from the measurement practices 
of 1935, when I first began visiting the 
30 school systems that were involved in 
the Eight- Year Study* Each teacher still 
makes up his own test? and examinations 
without the help or criticism of his col- 
leagues, and it is still uncommon for two 
or more teachers to grade them in- 
dependently. These tests and examina- 
tions rarely get at anything more than 
knowledge and skills. When thev trjrto 
get at anything like creativity, imagina- 
tion, appreciation, critical thinking, or 
attitudes, they usually come acroppcr. 
The reliability of these tests is rarely 
computed, and item analysis is almost un- 
known, Students are still marked on al- 
most everything they do, and teachers 



573 



574 



ENGLISH JOURNAL 



have no idea that there is any way to find 
out how much measurement ot a given 
objective is enough. At the end of each 
marking period, they add together dif- 
ferent lands of measures ot different 
objectives and translate the "average 11 
into a single course grawe, Although this 
process seldom has any rational or mathe- 
matical foundation, it probably causes 
teachers, students, and parents mote 
trouble, worry, and heartache than any 
other aspect of school work. No com- 
petent investigator would use these 
grades as evidence of what the students 
had learned, but they pass as coin of the 
realm even though we all know that 
many of them arc worthless* This was 
the situation in 19)5, and that is the 
situation today. Why have ttic time, ef- 
fort, and ingenuity devoted to improving 
measurement by teachers produced not 
one single change of any consequence 
in J2 years? 

The only common characteristic of 
all these efforts at improvement that I 
can think of is that they left the problem 
where it had been to start with: namely, 
in the laps of individual teachers. Afy 
own conclusion is that the individual 
approach has clearly failed, and there is 
no reason to suppose that it will succeed 
any better in the next 32 years. I see no 
hope for any significant improvement 
until the individual approach is aban- 
doned, and measures ot the four, five, or 
six most important objectives in each 
field arc prepared,, reviewed, revised, ad- 
ministered, scored, reported, and analyzed 
by cooperative action of departments or 
teaching teams. 

This leaves room for two other types 
of evaluation that should continue to be 
handled by individuals: what I call "in- 
structional evaluation" by teachers and 
"self-evaluation" by students. Instruc- 
tional evaluation includes everything 
that a teacher docs in class, in conference, 
and in grading his own tests and assign- 
ments to keep an eye on how things arc 
O going. Its principal function is the 

ERJC 



guidance and reinforcement nf his teach- 
ing, I have some doubt that it should 
ever enter or affect the permanent rec- 
ords of students, but this may be going 
too far, Self-evaluation by students is 
not os well understood or used as in- 
structional evaluation, but it can be 
argued that students should have a recog- 
nized part in evaluating their own dc- 
vcloomcnt until they arc almost inde- 
pendent of external evaluation— as they 
must be most of the time in adult life. 
In other words, one of the objectives of 
each field must be to help thcru become 
more competent and more responsible 
cvaluators of themselves. Doth instruc- 
tional evaluation by teachers and self- 
evaluation by students arc outside the 
jurisdiction of the department or team, 
But there remain usually four, five, or six 
major, continuing objectives of instruc- 
tion in each field that arc better regarded 
as the collective responsibility of the de- 
partment than a< the individual responsi- 
bility of each tcr chcr. It is my contention 
that responsibility for the measurement of 
these objectives should be transferred 
from individual teachers to the depart- 
ment or team, and that immediate and 
striking improvements over the measure- 
ment practices of individual teachers will 
result. 

BEFORE I tel! you how to do it, I 
should say something about five ob- 
jections or questions that occur to every- 
one immediately, and rhat may lead you 
to reject this scheme out-of-hand and to 
stop reading at this point. The first is 
that departmental examinations will 
drastically curtail freedom of teaching. 
Suppose one teacher prefers to introduce 
the detective story through a short story 
by Conan~l/6ylc while another prefers 
Edgar Allan Poc» Will both have to use 
the same story to prepare for the de- 
partmental examination? 

Not at alt Our practice is to set for 
the examination an entirely different 
story— for example, one by Carter Dick- 



VREV AHATION AN!) RATING Oh USSAY TESTS S7S 



son or Rex Stout-Unit all teachers arc 
forbidden to discuss and that students 
Have to read on their own before the 
examination. They learn how to in- 
terpret and analyze such a story through 
any example of the detective story that 
each teacher likes to teach, hut they show 
that they know how to do it through a 
somewhat easier story that they have to 
figure out for themselves, This practice 
avoids the danger that literary com- 
petence can sometimes be counterfeited 
by ability to recall and reproduce- to 
"parrot back' -the teacher's analysis and 
interpretation. Since questions arc written 
by several teachers and reviewed by 
others, it also avoids the danger that the 
examination may be dominated by the 
point of view of a single teacher. This 
requires an ability to diagnose the teach- 
er that is hard on students with inde- 
pendent minds, Occasionally one hears 
them saying that they answered as they 
did, not because they believe it but be- 
cause that is the sort of thing tficir teach- 
er likes. In the situation I have described* 
they do not know who wrote the ques- 
tion or who will grade their answers, and 
so the safest thing to do is to write what 
they really think. 

A second objection to departmental ex- 
aminations that may be in your minds 
runs something like this: "If 1 had to 
work on an examination with those 
feebleminded buzzards who make up my 
department, I'd resign." I know that feel- 
ing, but in my experience it makes no 
difference. You don't have to like or trust 
your colleagues to prepare a good ex- 
amination or other measure by a division 
of labor. My hunch is that the result is 
somewhat better if not too much brother- 
ly love prevails in a department, for then 
the parts of the examination will )>e 
prepared with greater care and criticized 
more rigorously. The worst measures I 
have seen were produced by a team in 
which there was so much togetherness 
that thev thought they had to meet every 
day and gabble until the questions some- 



how got written. Very little talk is 
needed. There is one meeting to agree 
on an outline (under the firm guidance 
of the department head) and to divide up 
the work of preparing the questions. 
These arc circulated in photocopies, and 
all members of the committee write in 
their objections and suggestions for im- 
provement. After some tunc for revision, 
the department head confers with each 
author to sec whether all reasonable ob- 
jections have been satisfied. It is seldom 
necessary or wise to call a second meeting 
to consider the revised examination, All 
proposed changes are settled in these 
conferences. If any part of the examina- 
tion has to be graded, no one need fear 
that his enemies will stick a knife into his 
students, since the papers arc identified 
only by code numbers, usually chosen at 
random by the students themselves. The 
papers are also graded independently by 
two different teachers, and wheneyer 
there is a substantial difference between 
tluir glades, the disputed papers arc re- 
ferred to a small committee of the most 
trusted readers. 

A third objection is sure to arise at 
this point: "Wc are too busy already, 
and work on these departmental exams 
will impose an additional burden on us 
that wc simply cannot accept" You may 
not believe me until you get involved in 
cooperative measurement, but 1 will stake 
my reputation on the promise that after 
you learn how to do it and cut out the 
busywork that it makes unnecessary, it 
will reduce the whole t.isk of measure- 
ment, grading, and record-keeping to a 
properly subordinate role. The principle 
of a division of labor was discovered a 
long time ago, and its uniform effect is 
to reduce the time it takes to get a job 
d^ne-as well as doing a better job. 

For example, take the job of grading 
papers for writing ability. In the school 
district in which 1 have done the most 
work on this problem, many devoted 
teachers used to think they had to grade 
a paper a week ^ others were willing to 



$76 



liNGl.iSII IOVRNAL 



settle for a paper every two weeks; and 
no one thought lie could get away with 
less than a paper a month. Hut when we 
studied this problem scientifically to de- 
termine how nuny papers were neces- 
sary to get reliable scores on eight com- 
ponents of writing ability, the answer 
turned our to be four papers a year, each 
rated independently by two different 
teacher*. Kvcry student in the three jun- 
ior high schools of this district writes a 
test paper in his English class on ihc 
same topic and on the same day during 
November, January, Alarcli, and May. 
Each student numbers his own paper 
with any number of six digits that pops 
into his head and writes no other identi- 
fication on his paper, bin he copies this 
number on a separate slip and adds his 
name, grade, teacher, section, and the 
date. These name-slips arc locked up 
until the rating is completed* The papers 
arc arranged in the numerical order of 
these scl^choscn numbers, which puts 



them in an obviously random order, and 
arc then divided into as many piles as 
there are teachers to rate them. Each 
pile has about the same number of papers 
thnt each teacher gets on an ordinary 
homework assignment (between 120 and 
150). Hut these test papers, planned and 
written within one class period, are much 
shorter than homework papers, and they 
represent a much wider range in ability, 
because each teacher gets papers all the 
way from the top class in Grade 9 to the 
bottom class in Grade 7. Sinre the dif- 
ferences among these test papers arc 
much more obvious than among those 
that a teacher gets from any one class 
they arc quicker and easier to rate. More- 
over, teachers arc forbidden to write any 
comments or corrections on these papers, 
because that would influence the judg- 
ment of the second reader. They en- 
circle one number for each quality on 
rating-slips like the one below: 



Topic. 



Ideas 

Organization 

Wording 

Flavor 

Usage 

Punctuation 
Spelling t 
Handwriting 



Reader. 



Paper. 



Low 
2 
2 
I 
1 

I 
I 
I 



Middle 
6 
6 
3 
3 

3 
3 
J 
3 



8 
8 
4 
4 

4 
4 
4 



High 
10 
10 

5 

5 

5 
5 
5 
5 

Sum 



How long does this take? Now that wc 
have developed a systematic way of 
doing it, and teachers have had a good 
deal of practice, the answer is an average 
of two minutes a paper. We find that it 
actually increases accuracy to work 
rapidly, and id we encourage teachers 



to trust their first impressions and rate 
boldly— with confid jicc that any serious 
error in judgment will be cauglit by the 
second reader. They do not have to be- 
lieve that the second reader will be a 
better judge but only that he is unlikely 
to misjudge any given paper in the same 



PREPARATION AND RATING OF USSAV THSTS 57? 



direction, Hence, when one rating is 
far off the beam, the other rating is likely 
to differ iti its total by more than ten 
points, and all such papers arc referred 
to a small committee of the most ex- 
perienced teachers for a third reading. 
At present, we find that only about one 
paper in 12 requires a third reading. Since 
most papers get two readings at two 
minutes apiece and a relatively small 
number get three, the total time for 
rating one of these tc5ts of writing 
ability now averages about five minutes 
per student, Mow much time does it take 
to grade, correct^ and comment on the 
average homework paper? Our figure is 
eight minutes per paper* and this was 
confirmed by a careful study under dif- 
ferent auspices in California. Hence, even 
with the double reading, the time spent 
by teachers in rating one of these writ- 
ing tests is less than they spend on a 
homework assignmcnt-and of course 
there is no homework assignment during 
the four v/eeks per year jn which these 
papers are written and rated. 

The reliability of the cumulative total 
of eight ratings on four test papers per 
year normally reaches or exceeds .80. 
This is lower than one wants in a con- 
trolled experiment but high enough for 
a practical judgment in the ordinary 
course of schoolwoik~and much higher 
than one ordinarily gets. This means that 
if wc added a fifth test paper, it would 
not change the relative position of enough 
students to justify tUe additional time. 
You can see why if you consider the 
large number of rating-points that stu- 
dents accumulate, The lowest possible 
total for the year is 80 points; the average 
is 240 points; and the highest possible 
total is 400 points. This spreads the stu- 
dents out so widely that an additional 
rating would not change the picture 
very much or in very many cases. That 
is what I meant when 1 said that most 
teachers have no idea that there is any 
way to find out how much measurement 
of^ eiven objective is enough. A depart- 



ment soon finds out. Wc stop when the 
reliability of our cumulative total reaches 
or exceeds .80. 

Of course, wc do not reduce practice 
in writing to these four test papers per 
year* Before each test there arc at least 
four homework papers that receive care- 
ful comments-but why grade them? I 
know your answer: "Because students 
raise Cain if wc don't/* That is true, but 
you should add, "under present condi- 
tions" When the grade that enters the 
record depends on the average of these 
homework papers, naturally they want 
to know how well they did on each one. 
But when the grade depends entirely on 
how well they write in four tests, they 
soon regard the homework papers as 
training for the tests, and then they value 
tips on what they did well or badly more 
highly than grades. Cutting our 'grades 
on homework papers saves time, worry, 
and arguments. Hence, even if rating the 
tests took more time than a homework 
assignment, it would save time in the 
more difficult task of dealing with 16 to 
30 homework papers per year. 

You may be thinking, "But grades 
based on these tests are obviously unfair. 
Since papers from Grades 7, 8, and 9 arc 
mixed together without identification, 
seventh-graders are bound to get the 
lowest ratings and ninth-graders the high- 
est/ 1 So they do. That is why students 
receive at least wo general indications 
of their position after each test. The first 
is their position up to this point in Grades 
7-8-9 combined. That is a very im- 
portant figure, because that is the one 
that moves. Since there h a great deal 
of natural growth in writing ability dur- 
ing these grades, the average student 
stands in the lowest third of this dis- 
tribution in Grade 7, the middle third in 
Grade 8, and the highest third in Grade 9. 
Hence, wc can measure growth much 
more accurately and convincingly than 
by our usual practice of grading severely 
at the beginning of each year and more 
leniently toward the end. 



578 



ENGLISH JOURNAL 



The second general indication of posi- 
tion is where each student stands among 
other student!; in the same grade with 
whom he may reasonably be compared: 
for example* remedial, rcguhr, or honor 
students. This is more nearly like present 
grades, but note that remedial students 
arc not forever condemned to the equiv- 
alent of D's and Fs, nor arc honor stu- 
dents guaranteed the equivalent of A's 
or BV Wc show tiicni where they stand 
in their own league, but wc also know 
where they stand in the total population 
Oi the school. 

An incidental benefit of this double 
grading of unidentified papers is that it 
puts the teacher and his students on the 
same side of the fence. He wants all of 
them to make the best possible showing 
on each test, but he cannot #/;v them a 
high grade; their papers will be rated 
anonymously by nil members of the de- 
partment. If a student gets a lower rating 
than he expects, his teacher can say quite 
honestly, u l have no idea who gave you 
that rating, and 1 have no power to 
change it, Rut let me get your paper 
and show you what you need to work 
on- III help you, and if you work hard, 
you can improve yom position in the 
next test." That is a refreshing change 
from the present situation in which wc 
have to argue with some students over the 
grades wc "gave* 1 them. With depart- 
mental measure^ these arguments vanish. 

Now let mc turn to a fourth question 
about these examinations: "Will our 
teaching be judged by the results?" Cer- 
tainly not. Everyone knows that some 
classes arc brighter, better prepared, and 
more highly motivated than other classes, 
and their high scores or ratings can lead 
to no defensible conclusions about their 
present teacher. When teachers analyze 
the results of these examinations, they 
first look at the kinds of questions or 
tasks on which each meaningful subgroup 
of the population did well or badly. For 
example, in the writing tests wc usually 
find that students in these grades show 



greater improvement in ideas, organiza- 
tion, and wording than in mechanics, and 
that students from disadvantaged areas 
quite naturally have the mcst scfious dif- 
ficulties with mechanics, But Occasionally 
wc find that some of the disadvantaged 
students have improved much mfirc in 
mechanics than wc usually expect— even 
though their scores arc still low. Mow in 
the world did they do it? Sometimes 
their teachers can offer a pretty shrewd 
guess. As other teachers of these students 
try similar procedures, they may find a 
similar improvement. Our policy is al- 
ways to look for some favorable result 
and to try to discover what accounts for 
it. As these successful practices arc more 
widely adopted, they will automatically 
replace the less successful. In any case, 
wc do not want teachers to think of their 
measurement program primarily as a way 
of finding out what they arc doing 
wrong. \Vc prefer to loox for things 
that work. 

A host of other objections to coopera- 
tive measurement arc summed up in the 
statement, "1 don't think Fd like it " That 
is quite natural, for teachers tend to be 
the most conservative clement in the 
community, and they can be counted on 
to oppose any procedure that is un- 
familiar to thein; but after they get used 
to it, they will defend it to the death 
against any further change. One ad- 
vantage ol cooperative measurement is 
that it makes very little difference 
whether one likes it or not* It gets sold 
to the superintendent, the Board of Edu- 
cation, and the principals on the ground 
that no significant change has come about 
in the measurement practices of individ- 
ual teachers as far back as anyone can 
remember, atuT'it is high time to adopt 
a departmental approach that has power 
to initiate change. The administrators 
then bring together the department heads 
or team leaders in air Evaluation Com- 
mittee, and in that public setting-one 
after another answers the; questions, 
"What objectives will your department 



PRKP/INATION AND RATING OF ESSAY TESTS $19 



ay 10 measure? On what dates? By what 
means?" When these people agree that 
a certain objective will be measured by 
a certain procedure within certain dates, 
no individual teacher can ignore it. The 
measure is prepared by a division of labor 
and administered on the scheduled dates, 
and all students to whom it applies take 
it, Then the papers arc scored or graded 
and the results analyzed by the teachers 
who are given this responsibility, and a 
report on these results is given at the 
next meeting of the Evaluation Com- 
mittee. At no point is there an oppor- 
tunity to say, U I don't think IM like it.' 1 
It simply assumed that if a department 
professes to be teaching something, it 
has an obligation to | .cscnt some sort of 
evidence that that thing is being learned. 

It is noteworthy that, whenever and 
wherever I have initiated a departmental 
measurement program, J have never 
known a department bead to report 
failure to prepare or administer a prom- 
ised measure within the scheduled dates. 
These arc public commitments, motivated 
in part by rivalry with other departments, 
and it would seriously embarrass a dc- 

[virtmcnt head to report in a meeting of 
lis peers that bis group had failed to 
meet its obligations. Compare this record 
with the usual result of exhorting in- 
dividual teachers to go home and improve 
their measurement procedures. They may 
try something once, although even that 
is unusual, but thereafter they go on doing 
what they have always done, and what 
teachers before them have done for 
generations. That might be all right if 
these traditional practices were satis- 
factory to teachers, students, and parents, 
but wc hear complaints about them on all 
sides. They arc maintained only by 
inertia and custom. On the other hand, 
in a departmental measurement .program, 
changes can be initiated and maintained 
by the binding force of public commit- 
ments, deadlines, and reports, The con* 
trol is democratic, but things get done. 
1 know that this sounds hardboilcd, 



and it is intended to be hartlboilcd, for 
I am fed up with exhorting teachers to 
do something intelligent about measure- 
ment and getting nowhere, As a matter 
of fact, however, as soon as teachers get 
involved in cooperative measurement, 
they like it. It makes the job easier, 
quicker, and more interesting by a divi- 
sion of labor; it puts teachers and students 
on the same side of the fence; it reveals 
answers to many teaching problems; it 
provides ammunition against our critics; 
and it adds fun and excitement to both 
teaching and learning. Incidentally, it 
brightens up the usual meetings of de- 
triments or teams because the teachers 
iavc something of real substance to work 
on together, and it yields results that 
they all want to discuss. 

I have now dealt with five objections 
to cooperative measurement: 

1, It will interfere with freedom of 
teaching. 

2, \x is disagreeable to work on examina- 
tions with other teachers. 

3, It will take too much time, 

4, Teachers will be judged unfairly by 
the results. 

J, "I don't think I'd tike it," 

I PROMISED to clear these objections 
out of the way before telling you how 
to do it, but I was not quite honest. In 
the course of dealing with these objec- 
tions, I think I have given you a pretty 
clear idea of how this plan works. The 
first step is to appoint an Evaluation 
Committee, consisting of heads of de- 
partments and special services, such as 
ibrary and guidance. In the school 
district in which I have done the most 
work on this program, wc built up this 
committee gradually. In the first year it 
represented guidance (including the as- 
sistant principals with special responsi- 
bility for discipline), English (together 
with the library), and social studies. In 
the second year wc added mathematics, 
science, and foreign languages. In t\xt 
third year we took on the fine and 



iiNOUStl JOURNAL 



practical acts, vocational education, and 
physical education. This kept vis from 
having to develop measures of too ninny 
iliiTcixttt objectives in any one year* Wc 
were also conuiu to strict with coopcra* 
tivc measures, of even one or two ob- 
jectives in each field, knowing that if wc 
broke the ice, other objectives would 
gradually be added. The Evaluation Com* 
mittee met only four times a year but 
each time for a full moi.Ymg, with sub* 
Mitutcs hired to cover classes, A clean 
break with the individualistic tradition 
of school evaluation cannot be made by 
tired people who always have to meet 
after school. The real work went on 
behind the scenes as committee members 
met with their departments or teams in 
their own schools to prepare, review* re- 
vise, administer, score, report, and analyze 
the results of the measures for which 
they were responsible* 

I hope you will not go away with the 
impression that all of these cooperative 
measures have to be something unusual, 
like the tests on literary works that all 
teachers were forbidden to discuss, or 
the four w riting tests per year. The back- 
bone of every school measurement pro- 
gram is the "ordinary" subject-matter 
examination that is given four, five, or 
six ti.mes a year. I have put "ordinary" 
in quotation marks because, when teach- 
ers work on these examinations together 
and expect them to provide defensible 
measures of the most important objectives 
of their program, they turn out to be 
anything but "ordinary*" There may be 
nothing unusual about the format, but 
the questions are prepared and criticized 
arid the answers arc scored or rated with 
a very clear Idea of the objectives that 
are to be measured. 

Some of the other measures that wc 
have developed arc extremely simple but 
helpful to Wh students and teachers, 
like our Record of Independent Reading, 
which is kept on ) x 5 index cards. As 
soon as a student finishes a book or 
decides to give it up, he fills out one of 



these cards with his name, grade, and 
the date; author and title; a number in- 
dicating the type of book; a rating of 
how much he liked it; and an indication 
of its difficulty (easy, medium, hard). 
Then he writes a caiulid comment about 
the book for the benefit of other students 
who are looking for something to read. 
In the periods reserved for independent 
reading, he sees other rmdents using these 
cards, looking for a book that their 
friends have recommended with appar- 
ently genuine enthusiasm. Hence the 
comments are extremely candid, and 
some of them curl the teacher's hair* but 
there must be no reprisals or these cards 
would lose their value for other students. 
Teachers also find them useful. In pre- 
paring for a conference on reading, they 
leaf through these cards and get a pretty 
clear idea of what the student likes and 
dislikes and the types of books 'that he 
has not yet explored. One of our most 
important anil most disturbing findings 
also grew out of this record. Wc found 
that there is a serious and widespread de- 
cline in the number of books read in- 
dependently beginning in Grade 9. The 
nintlvgnulcrs turned in just two-thirds 
as many hook cards as the eighth-graders 
in all three schools. We ran through 
many glib explanations of this decline and 
finally came to one that concerned us 
deeply: that this Is the point at which 
most students have to make the transition 
from juvenile to adult reading* and a sur- 
prising number can't do it, As a result, 
our tests on literary works focus on the 
difficulties in adult books that the average 
student cannot cope with, and we are 
making a concerted effort in our classes 
to find out how these difficulties can be 
overcome. 

TVTITIl this general background in 
VV mind, let me turn to the preparation 
and grading of es$ay tests and examina- 
tions, Although thisls the happy hunting 
ground of the individual teacher, it is in 
this field that 1 see no possibility what- 



VlUiVAKATlON AM) H ATI NO OF liSSAl TESTS 581 



ever that an individual can improve his 
measures nil by himself. As for the prep- 
aration of these tests, if I bad to picw out 
a single fault of a typical essay cxamiiu- 
tlon that I would condemn above all 
others, it is that it depends altogether too 
much on the slant-the opinion and pref- 
erences, itiul sometimes tltc ignorance and 
dogmatism~of a solitary Individual, un- 
checked by the eriticiWof bis colleagues. 
Although teachers try to be tolerant of 
divergent opinions 'and undoubtedly wcl* 
come them when they arc expressed 
cogently by superior students, the snfest 
course for the average student is to give 
the teacher what he wants. The only anti- 
dote I know to this dominance of a single 
point of view is the cooperative prepara- 
tion, review, and revision of examination 
topics or questions, and of the guidelines 
that arc to be used in grading the answers, 

As for grading the essays, an indi- 
vidual teacher never finds out when he is 
wrong-or may be wrong— because other 
teachers never disagree with bin). He sens 
the students every day in class and quick- 
ly forms an impression of their ability, 
attention, industry, thoroughness, and the 
like* Then, when he reads their papers, 
knowing who wrote them, he uncon- 
sciously reads into the papers cither 
more or less than is actually there. 

This effect was prettily illustrated in 
a study conducted by Dr. Benjamin Ros- 
ncr a few years ago in which test papers 
from 12 school enstricts were sent to a 
central office where all identification ex* 
ccpt a code number was removed, and 
the papers were sent back in a random 
order to be graded. The teachers pro- 
tested that they could not grade them 
fairly unless they knew at least whether 
they came from regular or honors classes 
because honor students should be graded 
by higher standards. Dr. Rosncr said 
that this presented an opportunity to find 
out what information about the \vritcrs 
was essential to accurate grading, and he 
promised to supply the information they 
if4iu^ p nc bit at a time on subsequent 



paper*. Hence the papers were stamped 
either "Hov" or "Girl," •'Grade 9" or 
"Grade to/ 1 "Regular" or "Honors," and 
so on. What the teachers did not know 
was that half of this information was true 
and half was false. The papers had been 
written on carbon-backed forms, so that 
Dr* Rosncr had three identical copies of 
each paper, One of these was stamped 
''Regular" while another copy of the 
very same paper was stamped "Honors. 11 
He made sure that no school received 
both copies of the same paper, but other- 
wise the papers were sent back in a ran- 
dom order* 

"Regular" vs. "Honors" proved to be 
the only bit of information that made 
any dittcrence, and the effect was the 
opposite of what the teachers expected, 
*1 he papers stamped "Honors" received 
significantly higher grades than the other 
copies of the very same pipers that were 
stamped "Regular*" Ihe explanation 
seems to be that we find what wo expect 
to find. If we think a paper came from 
an honors class, we expect it to be pretty 
good-aad that is what wc find, hut if 
we think it came from a regular class, wc 
expect it to be only so-so-and that is 
what we find. If a single word stamped 
on a paper can have th*at much ctfect on 
grades, consider how much effect the 
hill personality of the student must have. 
That is why papers so rarely surprise us, 
We keep on reading into them our im- 
pression of the student that we gathered 
during his first month in class. And even 
when the paper docs surprise or disap- 
point us, wc may change too little. I often 
used to think, u Too Wd; he had an off 
day. I'm afraid Til have to reduce his 
grade to a ft " But the same paper written 
a poor student might easily have re- 
ceived a D or an F. 

TTENCE, I believe that the first step 
AX toward the improvement of essay 
grading js to find out how widely teach- 
ers disagree when they all grade photo- 
copies of the same paper and do not 



UNCUS!! \OU UN Ah 



know whose paper they arc grading. In 
college I used to reproduce about one pa- 
per a month and have it graded and com- 
mented on by all members of the de- 
partment, At tUc beginning of each year 
I never failed to get every grade from A 
to F. In our nest meeting 1 would write 
on the blackboard how many gave it 
an A, how many gave it a IV and so on. 
Although the teachers were always 
shocked, I tried not to be, I would just 
say that this always happened, and the 
only thing we could do about it was to 
argue out our differences. Then 1 would 
turn to some highly respected teacher 
and ask him why he gave it an A. After 
listening to his explanation, 1 would turn 
to some friend of his and ask him why he 
gave it an F. The curious tiling was that 
both teachers often saw die same things 
in a paper but weighted them differently. 
One might say that there were a great 
many careless errors, but what counted 
was that the boy had something to say 
and said it rather forcibly. The other 
might reply that this was true, but when 
a student with so much natural talent 
had gone this far in school without both- 
ering to learn thQ ordinary conventions 
of writing, he gave the paper an F to 
show him that he could not get away 
with it any longer. 

That brought up a policy question: 
how should we grade a paper that had 
ideas and managed to get them across 
but contained this many mechanical er- 
rors? On the other hand, how should we 
grade a paper that was impeccable in 
mechanics but said practically nothing? 
As we argued over questions like these— 
not in the abstract, but in the presence 
of examples of what we were talking 
about— wc gradually came closer to- 
gether. In judging anything as complex 
as writing abihty, however, I think it is 
unrealistic to expect a higher average 
agreement in a department than is repre- 
sented by a Correlation of .5. This \ the 
usual correlation between height and 
weight. It is by ivo means hopeless. As I 



have previously illustrated, all that is 
necessary to get it up to a reliability of 
,8 is four samples of each student's work, 
each rated independently by two read- 
ers, with a third rating for papers on 
which there is substantial disagreement, 

Some teachers profess astonishment at 
the low level of agreement that I expect 
and tell me that in their department they 
hardly ever disagree on an essay grade by 
more than a plus or minus, I know how 
to do that, too, One way is to put the 
grade nt the top of each paper and back 
it up with a number of corrections and 
comments in red ink; then hand the pa- 
pers to some other teacher to sec whether 
he agrees. Of course he will, Grading is 
such a suggestible process that a paper 
with a B oti it already begins to look 
like a li paper. Due, you may sav, I put 
my grade on the back and ask him not 
to look at it until he puts his grade on 
the front, 1 am sorry, but I cannot be- 
lieve that this way of concealing the prior 
grade is very effective, because I get 
nothing like this agreement when there is 
no grade or comment written on the pa- 
pers by any teacher and when the read- 
ers do not even know who graded them 
previously. 

Another way to reach high agreement 
may be illustrated by an cssa^y question 
I remember from an examination on 
Homer's Odyssey: "Write a unified cssav 
on the women in the Odyssey" This js 
the "unstructured" type of question that 
literature teachers love, It is supposed tn 
get at ability to organise material in- 
dependent "thinking, critical insight, 
originality, imagination, and other line 
qualities. Hut the specifications used in 
grading the answers .were quite different. 
First, the staff made a list of about II 
women in the Odyssey that they thought 
students' should remember and gave live 
points for each one that a student men- 
tioned. Hut they subtracted one ooint for 
misspelling the name, another for omit- 
ting or mistaking the place where she 
lived, and a third^for mentioning her out 



PR HP A HAT I ON AND RATING OV ESSAY TESTS 58} 



of order. Then tticy put down three 
tilings about each woman tliat they 
thought students should remember .mil 
gave cither one, two, or three points f f>r 
each one, dependim,' on the accuracy of 
llie statement. At tlic end, they allowed 
each reader to give from one to five 
points for what they called "good writ- 
ing." -l?nch paper was graded quite in- 
dependently by two renders, and they 
boosted that the average agreement or 
correlation between pairs of readers was 
.80. 1 did not doubt it, but what about all 
those fine objectives? All that they really 
measured was total recall of what hap- 
pened plus ability to spell some rather 
difficult Greek names, 

SlNCti tins is obviously not the best 
way to grade an essay question that 
has sonic factual content, what is a better 
way? First, in the way the question is 
stated, I believe that thctc ought to be 
more "structure,*' for in my experience 
the "unstructured" question gives more 
weight to memory than we want, liven jf 
we arc not so obvious about it as the staff 
! mentioned, wc arc unconsciously in- 
fluenced by such details as getting Circe 
on the wrong island, misspelling Nan- 
sicaa, or forgetting about the slave-girl 
Melantho, I would give students most of 
the details that this staff expected them 
to remember: a list of women in the 
order of their appearance, the place 
where each lived, and one fact about 
each one that would recall her to 
their minds, such as ■ "Circe, Acaea, 
changed men into pigs. 1 - Then I would 
indicate that they were not expected to 
comment on all of them but on not more 
than five or six that would illustrate the 
points they intended to make. I would 
even go so far as to suggest some of the 
kinds of points they might make, such 
as the traits of character in Odysseus 
that these encounters brought out, the 
ways in which these women resembled or 
differed from modern women, or the 
speculation of Samuel Butler that the 




prominence of women in the Odyssey 
suggests that it was composed by a 
woman, Of course 1 would indicate that 
these points were intended only to illus- 
trate the kinds of points they might 
make, and that they should fee^ free to 
comment on anything they noticed about 
the women in 'the Odyssty that struck 
them as interesting. 

The next thing I would like would be 
for two or three members of the com- 
mittee to write a short paper on this ques- 
tion within the time limit to be observed 
by students. Such papers may bring to 
light unusual treatments of the topic that 
might not occur to most staff members 
and that might be rejected as unsound if 
they were first encountered in student 
papers, In otl cr words, the staff papers 
may break up preconceived ideas of the 
sort of essay that students ought to write. 
They may also suggest some of the quali- 
ties that should he looked for in superior 
papers, and they may keep the younger 
staff members from' expecting more of 
students than teachers can do. 

I would usually insist that students 
bring their copies of the Odyssey to such 
an examination, and I would have some 
extra copies for those who forgot, This 
open-book policy reduces fear of the 
examination and our own reliance upon 
accuracy of recall in setting the questions 
and grading the answers. It also enables us 
to encourage students to support their 
points by relevant short quotations. I 
myself believe that even examinations on 
some portion of a textbook in history, 
science, and the like should usually be 
opeivbook examinations, but I can imag- 
ine some situations in which this would 
be unnecessary or inappropriate* 

In preparation for grading the answers, 
I like two or three staff members to take 
home a number oPpapcrs and bring back 
sample napers to illustrate one or more 
types of good, average, and poor answers, 
possibly with a few comments pointing 
out tne distinctive characteristic of 
these papers. These may be duplicated in 



RNOUSII JOURNAL 



photocopies and discussed in a short 
meeting before the papers arc distributed 
for rating. If any stuff member finds sonic 
papers hard to hue because they hear no 
resemblance to. any of rhe sample papers, 
he -should be encouraged to discuss them 
with a st^lT member who worked on the 
selection of these samples. Usually wc 
insist that nothing be written by touch- 
ers on any test paper, but that ratings (and 
comments when necessary) be recorded 
on a separate sheer, or sometimes on a 
small rating-slip for each paper. These 
arc handed to tnc department head along 
with the papers as soon as each reader 
finishes Ins set. The department head 
locks up the ratings in a safe place but 
hands on the papers to some other reader, 
usually selected at random, for a second 
rating. 

After this second rating, both the pa- 
pers and the ratings arc usually arranged 
".in the numerical order of 'the code num- 
bers written, on the papers, and someone 
pulls out the papers on which the two 
ratings differed by more than a certain 
amount that the department will learn by 
experience. Usually it is an amount that 
will cull out not more than ten percent 
of the papers for a third reading by a 
small committee of the most experienced 
and trusted readers. Some departments 
average all three ratings; others substitute 
the third rating for whichever of the two 
previous ratings is farther from the rating 
of the committee. If the ratings arc re- 
corded on separate small rating-slips, I 
myself like the practice of recording the 
committee rating in red on the rejected 
rating-slip and filing this slip under the 
name of that rater. This practice frightens 
teachers when they first hear about it, 
but they soon find that nothing disagree- 
able happens, Everyone must expect to 
have some of his ratings rejected, but 
usually just two or three of the newer 
members of the department accumulate a 
considerable number pf these "rejects. 1 * 
Some time after the examination, one 
q icmber of the review committee goes 



over these papers with the staff members 
whose ratings were rejected and explains 
why the committee thought that their 
rating was too low or too high, lie lis- 
tens politely to anything they may have 
to say in replv and agrees witn their good 
points but tries to correct any mfeundcr- 
standing that comes to light. There is no 
reason whatever to regard these private 
sessions as a reproach, and no one else 
need know about them. The newer staif 
members just have to learn the standards 
prevailing in the department, and this 
takes time and help. I can think of no 
more tactful or effective way of doing it. 
Usually these rcadcis are brought within 
reasonable distance of departmental stan- 
dards within a year, and only once in 
several years do wc find a reader whose 
judgment is so erratic that he probably 
ought not to grade these essay questions 
in departmental examinations. Even that 
is no calamity, for there arc plenty of 
other things for him to do. For example, 
hft may be particularly good at devising 
objective questions, or he may be a 
superb director of plays, In these days 
of team teaching, we should not expect 
teachers to be equally good a 4 everything. 

By way of contrast, some administra- 
tors have a touching faith in what they 
call the "training 1 * of readers by some 
consultant on measurement. They some- 
times invite me to meet with their En- 
glish department between 3:00 and 4:00 
some afternoon, and in that time they 
expect me to show them how to grade 
papers in a way that will yield fabulous 
agreement. There is no magic secret that 
can be taught in one session, In that time 
I can only get them to worry abbut the 
problem, but they have to work out a 
solution for^dern selves, and it takes a 
long time. If 1 became a department head. 
I should expect it to take about three 
years before we could establish reason- 
ably uniform standards in grading even 
those few examinations that wc all 
worked on together. Even these standards 
(Ccntlniud on page S90) 



Cooperative Preparation and Rating 
of Essay Tests 

(Continued from p.ige $S4) 

can hardly be regarded as a ^traitjacket. 
Remember that in rating anything that 
Is very complex, I expect only as much 
agreement between two independent rat- 
ings as we usually find between height 
and weight. If even that amount of agrcc- 

c 



ment seems unduly restrictive to some of 
you r proud, independent spirits, I won- 
der why we should pretend to be Wblc 
: to teach anything like good writing if no 
two of us can agree even this much on 
what it is. 



